Internet Info 1997 December

home *** CD-ROM | disk | FTP | other *** search

/ Internet Info 1997 December / Internet_Info_CD-ROM_Walnut_Creek_December_1997.iso / ietf / urn / urn-archives / urn-ietf.archive.9612 / 000039_owner-urn-ietf _Fri Dec 20 13:53:15 1996.msg < prev next >

Wrap

Internet Message Format | 1997-02-19 | 12KB

Received: (from daemon@localhost) by services.bunyip.com (8.6.10/8.6.9) id NAA01599 for urn-ietf-out; Fri, 20 Dec 1996 13:53:15 -0500 Received: from mocha.bunyip.com (mocha.Bunyip.Com [192.197.208.1]) by services.bunyip.com (8.6.10/8.6.9) with SMTP id NAA01592 for <urn-ietf@services.bunyip.com>; Fri, 20 Dec 1996 13:53:11 -0500 Received: from dicsmss1.jrc.it by mocha.bunyip.com with SMTP (5.65a/IDA-1.4.2b/CC-Guru-2b) id AA22339 (mail destined for urn-ietf@services.bunyip.com); Fri, 20 Dec 96 13:52:54 -0500 Received: from jrc.it (elect6.jrc.it) by dicsmss1.jrc.it (4.1/EB-950131-C) id AA12960; Fri, 20 Dec 96 19:58:19 +0100 Received: by jrc.it (5.x/EB-950213-L) id AA15610; Fri, 20 Dec 1996 19:52:29 +0100 Date: Fri, 20 Dec 1996 19:52:28 +0100 (MET) From: Dirk.vanGulik@jrc.it X-Sender: dirkx@elect6.jrc.it To: Ryan Moats <jayhawk@ds.internic.net> Cc: urn-ietf@bunyip.com Subject: Re: [URN] Pre-release of draft-ietf-urn-syntax-02.txt In-Reply-To: <32B56D9D.7B96@ds.internic.net> Message-Id: <Pine.SOL.3.91.961220185310.15549B-100000@elect6.jrc.it> Reply-Path: Dirk.vanGulik@jrc.it Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-urn-ietf@services.bunyip.com Precedence: bulk Reply-To: Dirk.vanGulik@jrc.it Errors-To: owner-urn-ietf@bunyip.com Just some comments on grandfathering and the % changes. The quick summary 0. A plea to make the <urn-set> representation the real thing; while still urging for the UA's to display/enter/... then with effective UTF8 decoding. 1. suggested removal of the '%' sign from the list. 2. suggested a utf-8 glyph encoding of '%' as a octed for the char itself (%25 I think) 3. suggested escape hatch for (grandfathered in) namespaces where the glyph representation is confirmed ambigious. (For example the $, #, pound and yen sign) 4. suggested escape hatch for namespaces where the identifiers have no meaningfull representation in the urn syntax (at least meaningfull in the sense that a UA could gain something by showing it to the user) such as spaces which have a purely binary object identifiers; base64 is the _should_ wannabe. 5. Some very SILLY suggestion about leaving the '.' in the NSS reserverd for future version or expansion. 6. Some extra requiments on the NSI if it is one of the URI prefixes. I can see valid reasons to do this; but also can see easy confusion with the UA interface; so if one proposes such a NSI idenfier she'd better address/defend it in the publication right away. Quick example: urn:email:dirk.vangulik@jrc.it urn:smtp:Marcel-1.08-1111165644-064SrPH@snoopy.ant.co.uk urn:nntp:some-message-id-as-above Although one might think these are a bad idea; the syntax and resolution allows for them; the do make a bit of sense and I can image people with actually have brains dreaming up persitant names for resources they already have avaible under a well know url. Have fun. Dw. -- URN Syntax -- Filename: draft-ietf-urn-syntax-02.txt -- -- 2.1 Namespace Identifier Syntax -- -- The following is the syntax for the Namespace Identifier. To (a) be -- consistent with all potential resolution schemes and (b) not put any -- undue constraints on any potential resolution scheme, the syntax for -- the Namespace Identifier is: -- -- <NID> ::= <let-num> [ *<let-num-hyp> ] -- -- <let-num-hyp> ::= <upper> | <lower> | <number> | "-" -- -- <let-num> ::= <upper> | <lower> | <number> -- -- <upper> ::= "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | -- "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" | -- "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | -- "Y" | "Z" -- -- <lower> ::= "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | -- "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | -- "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | -- "y" | "z" -- -- <number> ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | -- "8" | "9" -- -- This is slightly more restrictive that what is stated in RFC 1738 [3] -- (which allows the period "."). Further, the Namespace Identifier is -- case insensitive, so that "ISBN" and "isbn" refer to the same -- namespace. I only recently realized this when we where making an updated version of a namespace; and we actually federated the namespace identification itself for a little test; i.e. urn:v2.inet:some.where:somestring Which obviously works with the v2.inet.urn.net lookup. I am not sure wether this is fluff functionality; and I feel that I am going down a slipperly slope; but you 'might' want to reserve the dot somehow for future expansion if you need an extra dimension; say because we do not have the syntax right now; you could do something like urn:duns.2:gorbaglobinrealniceunicode I.e. have an escape option. -- To avoid confusion with the "urn:" identifier, the NID "urn" is -- reserved and MUST NOT be used. After some confusion here; I would like to add something along the lines of: If a proposed namespace identifier is already in use as a protocol specifier in the URI space, such as for example ftp, http or email, the proposal for this identifier should justify this choise; i.e urn:email:dirk.vangulik@jrc.it or urn:ftp:someplace:abc.efg urn:ftp://someplace/abcd.efg Because I can see people wanting to re-use some of the existing well know protocol specifiers; but that would _really_ cause confusion esspecially when the User agent allow you to type 'ibm' and really turn it into http://www.ibm.com:80/home.html these days :-) -- As required by 1737, there is a single canonical representation of -- the NSS portion of an URN. The format of this single canonical form -- follows: -- -- <NSS> ::= *<URN chars> -- -- <URN chars> ::= <trans> | "%" <hex> <hex> -- -- <trans> ::= <upper> | <lower> | <number> | <other> -- -- <hex> ::= <number> | "A" | "B" | "C" | "D" | "E" | "F" | -- "a" | "b" | "c" | "d" | "e" | "f" -- -- <other> ::= "(" | ")" | "+" | "" | "," | "-" | "." | "/" | -- ":" | "=" | "?" | "@" | "%" This means that a '%' followed by a non-hex encodes itself, but so urn:abc:jan%is%gek and is no problem. But than suppose you _want_ to replace the 'is' by a 'ef' which looks like hex (though you did not intend it. So I would propose to re-formulate it as: ++ <other> ::= "(" | ")" | "+" | "" | "," | "-" | "." | "/" | ++ ":" | "=" | "?" | "@" And in the text -- Depending on the rules governing a namespace, valid identifiers in a -- namespace might contain characters that are not members of the URN -- character set above (<URN chars>). Such strings MUST be translated -- into canonical NSS format before using them as protocol elements or -- otherwise passing them on to other applications. Translation is done -- by encoding each character outside the URN character set as a -- sequence of one to six octets using UTF-8 encoding, and the encoding -- of each of those octets as "%" followed by two characters from the -- <hex> character set above. The two characters give the hexadecimal ++ representation of that octet. The percentage sign _itself_ is encoded ++ as an octed index into its UTF-8 representation -- Namespaces MAY designate one or more characters from the URN -- character set as having special meaning for that namespace (An -- example of this is the "%" character in the URN syntax itself). If -- the namespace also uses that character in a literal sense as well, -- the character used in a literal sense MUST be encoded with "%" -- followed by the hexadecimal representation of that octet. Therefore, -- the process of registering a namespace identifier shall include -- publication of a definition of which characters have a special -- meaning and how to encode these characters if used in a literal -- sense. Now this kind of 'clashes' with the needs to grandfather in various schemes as their glyfh encoding might *NOT* be an unambigious character index. For example if the indexes or local control identifiers are collected over various contries; which each use their own (say latin) charset, but the actual LCI is just the sequence of the character index number _ITSELF_ into the various charsets. So encoding the glyf using UTF8 it represents in one charset display mapping would most certainly wrong for another one; and if the people interoperate you have broken the LCI scheme. The reverse also happens; and on top of that normalizing in UTF8 makes it even more fun. So for this reason I suggest the follwing change of wording -- Depending on the rules governing a namespace, valid identifiers in a -- namespace might contain character indexes that are not members of the ++ URN character set above or have a different glyph represenation in the ++ above charset. (<URN chars>). Such strings MUST be translated -- into canonical NSS format before using them as protocol elements or -- otherwise passing them on to other applications. Translation is done -- by encoding each character index outside the URN character and/or ++ with a different glyph interpretation. If the glyph and indexes ++ representation is unambigious then it SHOULD be encoded as a ++ sequence of one to six octets using UTF-8 encoding, and the encoding -- of each of those octets as "%" followed by two characters from the -- <hex> character set above. The two characters give the hexadecimal ++ representation of that octet. The percentage glyph _itself_ is encoded ++ as an octed index into the UTF-8 representation. As an aside; could someone more knowledgable than me add something about normalization of the UTF-8 string; as this does aid the encoding when a user is allowed to enter a URN from a keyword with other keys than than those corresponding to the symbols of <urn-set>. -- Namespaces whose object identifiers do not have a mapping to an -- information conveying representation SHOULD use a base64 encoding -- of the binary encoded identifier. -- Namespaces MAY designate one or more characters from the URN -- character set as having special meaning for that namespace (An -- example of this is the "%" character in the URN syntax itself). If -- the namespace also uses that character in a literal sense as well, -- the character used in a literal sense MUST be encoded with "%" -- followed by the hexadecimal representation of that octet. Therefore, -- the process of registering a namespace identifier shall include -- publication of a definition of which characters have a special -- meaning and how to encode these characters if used in a literal ++ sense. Likewise if glyph ambiguity does not allow for UTF8 encoding ++ the publication should outline the possible ambiguity and ++ representations possible for display. By the way; I do feel quite strong about glyph ambiguity and the ability to encode arbritary strings. I would even be happier if each name scheme essnetially defined _ITSELF_ what glyph encoding it uses; and only suggest extremly strongly that UTF is the way to go. But that the bottom line is that: ++ The URN is a sequence of glyphs taken from the <urn-set>. It is ++ represented by an octed stream with the only the values of the ++ indexes of the <urn-set> glyphs in UTF-8. A glyph or symbol encoded, ++ entered, stored, displayed or conveyed which is not in this limited ++ range, for examply with an UTF-8 encoding, is purely based on the ++ mutual understanding of the publishing and utilizing party; but is ++ not to be used to represent the URN. The above needs to be tackled anyway, somewhere; unless I missed the place in the current draft which tells us which is really the authorative representation (between machines :-) of the beast. (Ignoring normalziaiton,etc). Well, Merry Chrismass, Dw.